Viewing functions as token sequences to highlight similarities in source code

نویسندگان

  • Michel Chilowicz
  • Étienne Duris
  • Gilles Roussel
چکیده

The detection of similarities in source code has applications not only in software re-engineering (to eliminate redundancies) but also in software plagiarism detection. This latter can be a challenging problem since more or less extensive edits may have been performed on the original copy: insertion or removal of useless chunks of code, rewriting of expressions, transposition of code, inlining and outlining of functions, etc. In this paper, we propose a new similarity detection technique not only based on token sequence matching but also on the factorization of the function call graphs. The factorization process merges shared chunks (factors) of codes to cope, in particular, with inlining and outlining. The resulting call graph offers a view of the similarities with their nesting relations. It is useful to infer metrics quantifying similarity at a function level.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Code Similarity Detection in Multiple Large Source Trees using Token Hashes

The ability to find similarities between two source code bases, or within one code base, has many uses including the detection of student plagiarism, the identification of intellectual property violations and the location of repeated code in a code base amenable to refactoring. Previous structure-metric approaches have used either suffix trees or modified Longest Common Subsequence algorithms t...

متن کامل

An Investigation into the Characteristics of Merged Code Clones during Software Evolution

Although code clones (i.e. code fragments that have similar or identical code fragments in the source code) are regarded as a factor that increases the complexity of software maintenance, tools for supporting clone refactoring (i.e. merging a set of code clones into a single method or function) are not commonly used. To promote the development of refactoring tools that can be more widely utiliz...

متن کامل

مکانیابی خطاهای پنهان نرم افزار با استفاده از آنتروپی متقاطع و مدلهای n-گرام

The aim is to automate the process of bug localization in program source code. The cause of program failure could be best determined by comparing and analyzing correct and incorrect execution paths generated by running the instrumented program with different failing and passing test cases. To compare and analysis the execution paths, one approach is clustering the paths according to their simil...

متن کامل

Code Similarity Comparison of Multiple Source Trees

This paper outlines the design of a code comparison tool, ctcompare, which use short sequences of lexical tokens from source code as a key in an inverted index to perform the code comparison. This technique allows the comparison of multiple source code trees simultaneously. Other significant features of the tool include the definition of a serialised token stream format which allows the indepen...

متن کامل

Late propagation of Type-3 Clones

Type-3 clones are duplicated source code fragments that span two or more identical sequences of tokens (whitespace and comments are ignored) that form a contiguous source code fragment interrupted by nonidentical token sequences. Several studies on the evolution of code clones have been conducted to detect patterns that can help to manage clones [3,6]. One of those patterns that is assumed to b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Sci. Comput. Program.

دوره 78  شماره 

صفحات  -

تاریخ انتشار 2013